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Abstract 

In previous works it was shown that protein 3D-conformations could be encoded into dis- 
crete sequences called dominance partition sequences (DPS), that generated a linear parti- 
tion of molecular conformational space into regions of molecular conformations that have the 
same DPS. In this work we describe procedures for building in a cubic lattice the set of 3D- 
conformations that are compatible with a given DPS. Furthermore, this set can be structured 
as a graph upon which a combinatorial algorithm can be applied for computing the mean 
energy of the conformations in a cell. 



1 Introduction 

In previous papers [1-5] we have built a series of mathematical tools for studying the multidimen- 
sional molecular conformational space of proteins, with the aim of understanding the dynamical 
states of proteins by building a complete energy surface. 

In this approach, the 3D-structures of protein molecules are encoded into a linear sequence of 
numbers called dominance partition sequences (DPS) [1-5], there are three of these sequences 
one for each coordinate x, y and z. For a molecule of TV atoms assigning to each atom a number 
in the range 1 — N, then for a coordinate c the DPS is : the sequence of atom numbers sorted in 
ascending order of the value of the c- coordinate of their respective atoms. 

A typical 3D-DP sequence may be something like 

{{(5)(3)(1)(4)(2)} X1 {(2)(4)(3)(1)(5)}„ {(1)(3)(5)(2)(4)},} 

which means that 

X 5 < X 3 < Xi < X4 < X2, 2/2 < 2/4 < 2/3 < yi < 2/5, Zi < z 3 < z 5 < z 2 < z± 

A simple example of DPS can be extracted from Fig. 1 where the a-carbon skeleton of the pancreatic 
trypsin inhibitor protein [6,7] is shown 
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Figure 1 

Numbered a-carbon chain stereoview of the pancreatic trypsin inhibitor protein. 

with the atoms positions numbered. The DPSs for this protein conformation are : 

{{(58)(49)(29)(48)(57)(27)(28)(31)(30)(52) (32)(47)(53)(50)(19)(26)(21)(56)(51)(24) 

(33)(20)(23)(55)(46)(25)(22)(34)(54)(18) (1)(45)(17) (5) (6)(35)(44) (2) (8)(43) 

(16) (9)(11) (7) (3)(10)(36) (4)(37)(42) (15)(12)(41)(40)(14)(38)(13)(39)} x , 

{(15)(16)(17)(14)(18)(37)(36)(13)(19)(38) (34)(35)(12)(11)(39)(20)(33)(46)(10)(40) 

(32)(47)(21)(45)(44) (9)(48)(31)(41)(22) (49) (50) (43) (42) (8)(51)(30)(23)(24)(52) 

(7)(29)(54)(53) (5) (27) (55) (26) (25) (4) (6)(28)(57)(56)(58) (3) (2) (1)} W , 

{(26)(27)(10) (8)(25) (7)(24)(11) (6) (9) (12)(28)(13)(33)(34)(15)(31)(32)(29)(17) 

(14)(23)(36)(41)(35) (3)(40) (5)(22)(16) (30) (4)(39)(43) (1)(18)(21)(19)(42)(20) 

(44)(38) (2)(37)(55)(48)(45)(51)(52)(57) (56)(47)(46)(54)(49)(53)(58)(50)}J (1) 

The main tool developped [3-5] in this approach can be described as a fluctuation amplifier: the 
small movements of a molecular system, which are essentailly sampled with the current computer 
simulating tools, are encoded by means of a simple combinatorial structure, from which we can 
generate the complete set of DPSs corresponding to realizable 3D-conformations that arise from 
the combination of these movements. In the preceeding papers [3-5] it was described how to build 
a graph whose nodes are the cells that are visited by the system in its thermal wandering, with 
edges towards the adjacent cells. 

As it was it was suggested in [1] molecular 3D-conformations are constrained in a small fraction of 
the cell volume, for this formalism to be useful the conformational volume inside a cell has to be 
probed and the mean energy of its conformations must be evaluated. The problem adressed in the 
present work is how from a DPS code the set of 3D-conformations it encodes can be reconstructed. 



2 A procedure for embedding molecular conformations in a 
cubic 3D spatial lattice 

We start by describing a procedure for embedding the molecular 3D-structures in a cubic lattice, 
this can be done using empirical data sampled from molecular dynamics simulations [7] 

Procedure 1 

1. First we determine the dimensions of the lattice by taking as reference the mean bond length 
between C a carbons, which is : 3.58A< 3.86A< 4.13A. In the example developped here we 
set this length arbitrarily to 20 lattice units, which gives a lattice spacing of 0.19A. 

2. From the range of variation extracted from molecular dynamics any segment between to lattice 
points with a length range between 3.58 x 20/3.86 and 4.13 x 20/3.86 is potentially C a -C a 
bond segment. The set of valid lattice bond segments, modulo a lattice translation along the 
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x, y and z axes, is the set of segments starting at the origin and ending in any lattice point 
that lies between two spheres of radius 3.58 x 20/3.86 and 4.13 x 20/3.86 respectively. This 
gives a total of 1883 primary segments, i.e. modulo a reflection through the xy, xz and yz 
planes. 

3. Next we determine the range of variation for the bond angles, which is greater than that for 
the bond length and varies considerably along the C a chain. For each bond angle A aiol2a3 along 
the main chain we determine two integer numbers : the floored minimum \jnin(A aia2a3 ) \ and 
the ceiled maximum range \max(A aia20l3 )~\ respectively. These divide the range -360 m 
a number of intervals which in our case give 
71°-74 -75 -76 -77 -78 -79 -80 -81 -82 -87°-89 - 
90 o -92 o -93 o -94 o -95 o -96 o -97 o -98 o -99 o -100 o -101°- 
103 -104 -105 -106 -107 -108 -109 -110°-112 - 
113 -114 -115 -116 -117 -118 -119 -120°-121 - 
124 -125 -127 -129 -135 -136 -138 -139°-143 - 
144 -147 -148 -149 -150 -151 -152 -153°-154 - 
155°-156 -157 -159 -162 -163 -167° (2) 

along the backbone has a range spanning a given interval set from (2). 

4. The set of allowed bond angles formed by pairs of primary segments are classified according 
to the following : 

(a) Their sign vector : from a broken line formed by two primary segments with end co- 
ordinates vl = {xo,yo, zq} and v2 = {x±, yi, z±}, a set of three sign vectors can be 
generated 

{sign{— xq), sign(-xi), sign(xo — xi)} 
{sign(-y ), sign(-y{), sign(y Q - y^} 
{sign(-yo), sign{~y x ), sign(y Q - y^} 

where the function sign(x) returns the symbols +,— and if a; is positive, negative or 
zercQ. In order to reduce the size of data we introduce the constraint that the first three 
signs of each vector must be +, from these all other sign classes can be generated by 
reflection symmetry through the planes perpendicular to x, y and z. A total of 328 sign 
classes are thus generated. 

(b) Their bond angle interval : pairs of primary bond segments within a sign class are sorted 
in angular interval subclasses according to their bond angle. 

5. Next lattice embedded molecular 3D-conformations of the protein backbone can be generated 
by joining lattice bond segments with the correct angular interval and the correct sign matrix 
between the atoms. 



3 The graph of lattice points 

Each bond between two atoms in the molecular backbone can be approximated by a lattice primary 
segment, and each pair of consecutive bonds can be approximated by a pair of segments having 
an angle within the bond angle dynamic range (2). In our approach molecular backbone 3D- 
conformations are characterized by the dynamic range of their bond lengths and bond angles, and 
by the sign matrix or equivalently the dominance partition sequence. We have seen in the previous 
section how to embed the molecular backbone in a discrete lattice, the problem we try to solve here 

lr The sign matrix and DPSs are equivalent since sign(xo — x\) = + means that xq > si[l]. 
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is how to enumerate the finite set of lattice 3D-conformations that fulfill the DPS constraints of a 
cell or set of cells. 

For this we need to build a graphical structure that we call the graph of lattice points in three 
steps 

Procedure 2 

1. we arbitrarily set the coordinates of the first C a (the root node) in the backbone to {0, 0, 0} 
to avoid translation ambiguities. 

2. we choose all segment pairs that have the same sign vectors as the first 3 C a s and angular 
value within the interval limits allowed for the first bond angle. 

3. building the (n + l) th lattice bond level in the C a backbone is done from each individual n th 
level bond segment by joining it with the second segments of those lattice bond pairs that : 

(a) the first segment is the same as the n th bond, 

(b) the angular value of the pair is within the range of the (n + l) t/l bond angles. 

4. for each level the nodes of the graph are the lattice points at the upper end of a bond segment. 
There will be two arcs between any two points in two consecutive levels if they are connected 
by a bond segment: a forward arc from the node in the n th level towards the one in the 
(n + l) th level if the two are connected by a bond segment, there is also a reverse arc from the 
(n + l) th \eve\ towards the n th level between the same two nodes. 

5. The root node has an arc towards every node in the upper level and reciprocally. 

4 An algorithm for determining the set of inter atomic dis- 
tances and weights 

In force fields currently used in molecular dynamics simulations [8] atoms are represented by point- 
like structures and conformational energy is calculated with a Hamiltonian which is a sum of two 
kinds of terms a) local : which involve groups of 2, 3 and 4 consecutive atoms needed to calculate 
the bond, bond angle and torsion angle energies; b) non-local involving pairs of atoms lying 
anywhere in the structure, these are needed for calculating the electrostatic and van der Waals 
energy terms. 

Usually energy is calculated for only one conformation at a time, not here : in this work we 
intend to calculate the energy for a great number of structures simultaneously, as each term in the 
hamiltonian may appear in many structures, we want to calculate it only once. For this we need to 
know in how many 3D-structures, or equivalently for how many paths in the graph of lattice points, 
our term arises. For the non-local terms we need further to enumerate the whole set of inter- atomic 
distances between pairs of atoms, this is possible on a lattice because squared distances are integer 
numbers. 

We first start by describing an algorithm that assigns two weights to each node : for a given node 
n the lower weight (LW n ) is the number of downward paths starting at the node and ending at 
the root node, the upper weight (UW n ) is the number of upward paths starting at the node and 
ending at some node in the top level. The algorithm is as follows : 
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Procedure 3 



1. For the root node we set its lower weight value to 1. 

2. We go to the next level. 

3. For each node n in the level we set the LW n value to the sum of its downward link nodes 
LWs. 

4. If the top level has not been reached we go to step 2. 

5. Otherwise for every node in the top level we set the upper weight to 1. 

6. We go to the previous level. 

7. For each node n in the level we set the UW n value to the sum of its upward link nodes UWs. 

8. If the root level has not been reached we go step 6. 

9. Otherwise we terminate the procedure. 

Thus two consecutive lattice points a and b will contribute the quantity Eb on d{d a ,b) x LW a x UWb 
to the global bond energy, where rf 0j (,is the length of the lattice segment between a and b and LW a 
and UWb are the lower and upper weights of a and b respectively. Similarly for the bond and 
torsion energy of consecutive lattice points a, b, c and d we have E ang i e (9a,b,c) x LW a x UW C and 
Etorsion(<Pa,b,c,d) x LW a x UWd respectively. These quantities can be calculated using a variant of 
the basic algorithm described in procedure 1. 

To determine the weights for distances between arbitrary pairs of nodes we need three new data 
structures : an integer flag variable in every node, a sorted table of distances and a register array 
whose lengh is the number of levels. 

Procedure 4 

1. We set the flag in every node to 0. 

2. For the N top nodes in the top level. 

3. For every node in the top level 

(a) We assign to each one a number ntop from 1 to N top - 

(b) We enter the node in the top level of the array register. 

(c) We follow every downward link in succesion to the previous level. 

4. For each node thus reached 

(a) We set its flag to the top level number and we enter the node in the corresponding level 
of the array register. 

(b) We follow every downward link in succesion. 

(c) Upon reaching the root level or a node whose flag value is equal to the current n top 

i. We calculate the distances between all nodes in the array register from the current 
to the top level. 

ii. Every calculated distance d a ,b between nodes a and b in levels L a > Lb is assigned an 
upper and lower weight : UW a and LWb- The array {d a ^b, LWb, UW a } is searched 
in the table of distances 
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A. If not found it is entered in the table together with the number count a ^ which 
is set to 1. 

B. If the array already exists in the table count a ,b is increased by 1. 

iii. We return to the previous node and from there we follow the next downward link 
to a node in the lower level. 

(d) Upon reaching a node in level L v whose flag value n > is different from the current 

^top 

i. We continue the downward exploration of the graph setting the flag of every node 
to n top . 

ii. We upon reaching the root node we do not compute distances between nodes in 
levels L v or lower. 

5. The procedure terminates when all downward links in every top node have been explored. 

Each distance d a ,b from the table of distances will have a weigth W a ,b = UW a + LWb + count a ,b 
which will we used to compute the van der Waals and electrostatic terms of the hamiltonian 

E v dw(d a ,b) x W a ,b and E e i ec (d a ^ b ) x W a ,b respectively. 

5 Discussion 

The procedures discussed in this work make possible removal of two important hurdles of the 
formalism in its way towards practical applications : 

1. The construction of realistic 3D-conformations with a given partition sequence. As was dis- 
cussed in [1], a lot of structural information disappears when replacing the coordinates of a 
molecule by the inverse sequences of its DPSs, however though the molecule appears heavily 
deformed all the secondary structure motifs : a-helices, /^-sheets, turns, ... together with the 
overall 3D folding arrangement can still be recognized. This means that the 3 x (N — 1)- 
dimensional volum^l of a cell in conformational space is big enough to allow for very lean 
codes, but it still exerts a constraint that keeps the 3D-structures close to the real ones. 

2. The combinatorial structures described above for building molecular conformations can be 
transformed to calculate their mean energy. Which is essential for comparing the results from 
this formalism with experimental data. 

It is also quite conceivable that in some practical situations the procedures decribed above may 
overwhelmed by the amount of data generated, in this case the combinatorial structures they 
generate will have to be pruned in order to make them useful. This question will be adressesed 
when submitting this model of conformational molecular space to phenomenological tests. 



2 For a molecule with N atoms. 
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